Distributional Clustering of Words for Text Classi cation

نویسندگان

  • L Douglas Baker
  • Andrew Kachites McCallum
چکیده

This paper describes the application of Dis tributional Clustering to document classi cation This approach clusters words into groups based on the distribution of class labels associated with each word Thus unlike some other unsupervised dimensionality reduction techniques such as Latent Semantic Indexing we are able to compress the feature space much more aggressively while still maintaining high document clas si cation accuracy Experimental results obtained on three real world data sets show that we can reduce the feature dimen sionality by three orders of magnitude and lose only accuracy signi cantly better than Latent Semantic In dexing class based clustering feature selection by mutual information or Markov blanket based fea ture selection We also show that less aggressive clustering sometimes results in improved classi cation accuracy over classi cation without clustering

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph-Based Clustering for Semantic Classification of Onomatopoetic Words

This paper presents a method for semantic classication of onomatopoetic words like “ひゅーひゅー (hum)” and “からん ころん (clip clop)” which exist in every language, especially Japanese being rich in onomatopoetic words. We used a graph-based clustering algorithm called Newman clustering. The algorithm calculates a simple quality function to test whether a particular division is meaningful. The quality f...

متن کامل

Athena: Mining-based Interactive Management of Text Databases

We describe Athena: a system for creating, exploiting, and maintaining a hierarchical arrangement of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurat...

متن کامل

Athena: Mining-Based Interactive Management of Text Database

We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive ...

متن کامل

A Term Association Translation Model for Naive Bayes Text Classification

Text classi cation (TC) has long been an important research topic in information retrieval (IR) related areas. In the literature, the bag-of-words (BoW) model has been widely used to represent a document in text classi cation and many other applications. However, BoW, which ignores the relationships between terms, o ers a rather poor document representation. Some previous research has shown tha...

متن کامل

Hypertext Categorization using Hyperlink Patterns and Meta Data

Hypertext poses new text classi cation research challenges as hyperlinks, content of linked documents, and meta data about related web sites all provide richer sources of information for hypertext classi cation that are not available in traditional text classi cation. We investigate the use of such information for representing web sites, and the e ectiveness of di erent classi ers (Naive Bayes,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010